Coursebook: Data Wrangling and Visualization
This coursebook is intended for participants who have completed the preceding courses offered in the Data Science in Python Specialization. This is the third course, Data Wrangling and Visualization
The coursebook focuses on:
plotlypandas objectThe final part of this course is a Graded Asssignment, where you are expected to apply all that you've learned on a new dataset, and attempt the given questions.
You will need to use pip install <library_name> to install any libraries listed below that are not already downloaded onto your machine. You then load the libraries into your workspace using the import:
import pandas as pd
import plotly as pl
import plotly.express as px
promotion = pd.read_csv('data_input/promotion_clean.csv')
promotion.head()
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | date_of_birth | age | previous_year_rating | join_date | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | region_7 | Master's & above | Female | sourcing | 1 | 1986-10-21 | 35 | 5.0 | 2013-10-11 | 8 | Yes | No | 49 | No |
| 1 | 65141 | Operations | region_22 | Bachelor's | Male | other | 1 | 1991-11-21 | 30 | 5.0 | 2017-09-19 | 4 | No | No | 60 | No |
| 2 | 7513 | Sales & Marketing | region_19 | Bachelor's | Male | sourcing | 1 | 1987-09-14 | 34 | 3.0 | 2014-05-29 | 7 | No | No | 50 | No |
| 3 | 2542 | Sales & Marketing | region_23 | Bachelor's | Male | other | 2 | 1982-02-17 | 39 | 1.0 | 2011-06-29 | 10 | No | No | 50 | No |
| 4 | 48945 | Technology | region_26 | Bachelor's | Male | other | 1 | 1976-02-22 | 45 | 3.0 | 2019-07-30 | 2 | No | No | 73 | No |
The information of the dataset:
employee_id : Unique ID for employee.department : Department of employeeregion : Region of employment (unordered)education : Education Level.gender : Gender of Employee.recruitment_channel : Channel of recruitment for employees.no_of_trainings : no. of other training completed in previous year on soft skills, technical skills etc.date_of_birth: employee date of birthage : Age of Employee.join_date: Employee join dateprevious_year_rating : Employee Rating for the previous year.length_of_service : Length of service in years.KPIs_met >80% : If Percent of KPIs(Key performance Indicators) >80% then 1 else 0.awards_won? : If awards won during the previous year then 1 else 0.avg_training_score : Average score in current training evaluations.is_promoted : Recommended for promotion.promotion.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 54808 entries, 0 to 54807 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 employee_id 54808 non-null int64 1 department 54808 non-null object 2 region 54808 non-null object 3 education 54808 non-null object 4 gender 54808 non-null object 5 recruitment_channel 54808 non-null object 6 no_of_trainings 54808 non-null int64 7 date_of_birth 54808 non-null object 8 age 54808 non-null int64 9 previous_year_rating 54808 non-null float64 10 join_date 54808 non-null object 11 length_of_service 54808 non-null int64 12 KPIs_met >80% 54808 non-null object 13 awards_won? 54808 non-null object 14 avg_training_score 54808 non-null int64 15 is_promoted 54808 non-null object dtypes: float64(1), int64(5), object(10) memory usage: 6.7+ MB
promotion.isna().sum()
employee_id 0 department 0 region 0 education 0 gender 0 recruitment_channel 0 no_of_trainings 0 date_of_birth 0 age 0 previous_year_rating 0 join_date 0 length_of_service 0 KPIs_met >80% 0 awards_won? 0 avg_training_score 0 is_promoted 0 dtype: int64
Don't forget to check and change the data types
promotion[['department','region','education',
'gender','recruitment_channel',
'KPIs_met >80%','awards_won?',
'is_promoted']] = promotion[['department','region',
'education','gender',
'recruitment_channel',
'KPIs_met >80%','awards_won?',
'is_promoted']].astype('category')
promotion[['date_of_birth','join_date']] = promotion[['date_of_birth','join_date']].astype('datetime64')
Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
The plotly.express module (usually imported as px) contains functions that can create entire figures at once, and is referred to as Plotly Express or PX. Plotly Express is a built-in part of the plotly library, and is the recommended starting point for creating most common figures. Every Plotly Express function uses graph objects internally and returns a plotly.graph_objects.Figure instance.
First, we need to subset our data which condition is only employees that promotion status is 'Yes'
promoted = promotion[promotion['is_promoted']=='Yes']
Then we can further preprocess our data into a more appropriate format for our visualization:
promot_gender = promoted['gender'].value_counts()
promot_gender
Male 3201 Female 1467 Name: gender, dtype: int64
Now, let's create a plotly figure using px.bar()
px.bar(promot_gender)

The resulting plot above are using px.bar() function and we put the data inside the function as a parameter. It will return a bar plot that the index as a x-axis and value as a y-axis.
We can do this iteratively until we have a visualization that suits our purpose. For example we want to rename the x and y text. We can add labels parameter.
px.bar(promot_gender,
labels = {
'index': 'Gender',
'value': 'No of Promoted Employee',
'variable': ''
})

labels parameter By default, column names are used in the figure for axis titles, legend entries and hovers. This parameter allows this to be overridden. The keys of this dict should correspond to column names, and the values should correspond to the desired label to be displayed.
So, by now, we already see how flexible it is to draw a visualization using px; we should play our creativity here!
For now, let's move into basic visual enchancements parts.
As we demonstate earlier, we already know that plotly.express parameter are highly customable, but that's also it's caveat: sometimes the number of options is very overhelming! So, in this part, we will show some basic plotly.express parts to add or edit.
Let's start by adding some parameter and labels to make out visualization more clear:
promot_gender
Male 3201 Female 1467 Name: gender, dtype: int64
px.bar(promot_gender,
title = 'Which gender have the highest promotion rate?',
color_discrete_sequence = ['Navy'],
labels = {
'index': 'Gender',
'value': 'No of Promoted Employee',}
)

As we can see, the plot are getting clearer as we adjust more part of the plots.
Visualization is a powerful way to deliver context from our data. If we could choose a good way to communicate our context, our audience will get the insight that we want to deliver.
In this part, we will cover some basic visualization contexts:
Categorical ranking is one of the most basic ways to communicate how our categorical variable could show a different behaviour between its levels in terms of a numerical output.
We will cover the basic way to do a categorical ranking and how to further breakdown the insight using the promotion dataset.
General Ranking
In visualize categorical ranking, we could use bar plot to show differences in magnitude of each levels in our categorical variables.
For a stater, let's try to see the ranking of department in terms of number of employee.
We will start by making the data aggregation using groupby:
Syntax:
df.groupby([COLUMNS_TO_GROUP]).AGGFUNC()[[VALUES]]
# aggregation
data_agg = promotion.groupby(['department']).count()[['employee_id']]
# using reset_index to reset the index become a column
data_agg = data_agg.reset_index()
# print the data
data_agg
| department | employee_id | |
|---|---|---|
| 0 | Analytics | 5352 |
| 1 | Finance | 2536 |
| 2 | HR | 2418 |
| 3 | Legal | 1039 |
| 4 | Operations | 11348 |
| 5 | Procurement | 7138 |
| 6 | R&D | 999 |
| 7 | Sales & Marketing | 16840 |
| 8 | Technology | 7138 |
Now, take a look at the visualization below:
# prepare visualization data
data_agg = data_agg.sort_values('employee_id', ascending=False)
# visualization
px.bar(
data_agg,
x = 'department',
y = 'employee_id',
title = 'Number of Employee by Department',
template='plotly_dark',
labels = {
'department' : 'Department',
'employee_id' : 'No of Employee'
}
)

A simple bar plot, if visualized properly, is really powerful for categorical ranking. Our plot is a very prominent example for that: we could already see which department is the highest or the lowest, and we could also see the big picture regarding the ranking in terms of number of employee.
We should also take a note on how some additional information could help a lot in making our visualization more informative.
Breaking Down a Ranking
As in previous example, visualizing a categorical ranking could help us gaining some insight. But, oftenly, we need to make some breaking down to the ranking in order to gain more insight.
Let’s try, for example, re-visualize the ranking but by breaking down into the promotion status:
data_agg = promotion.groupby(['department','is_promoted']).count()[['employee_id']].reset_index()
data_agg
| department | is_promoted | employee_id | |
|---|---|---|---|
| 0 | Analytics | No | 4840 |
| 1 | Analytics | Yes | 512 |
| 2 | Finance | No | 2330 |
| 3 | Finance | Yes | 206 |
| 4 | HR | No | 2282 |
| 5 | HR | Yes | 136 |
| 6 | Legal | No | 986 |
| 7 | Legal | Yes | 53 |
| 8 | Operations | No | 10325 |
| 9 | Operations | Yes | 1023 |
| 10 | Procurement | No | 6450 |
| 11 | Procurement | Yes | 688 |
| 12 | R&D | No | 930 |
| 13 | R&D | Yes | 69 |
| 14 | Sales & Marketing | No | 15627 |
| 15 | Sales & Marketing | Yes | 1213 |
| 16 | Technology | No | 6370 |
| 17 | Technology | Yes | 768 |
# prepare visualization data
data_agg = data_agg.sort_values(by = 'employee_id')
# visualization
px.bar(
data_agg,
x = 'employee_id',
y = 'department',
color = 'is_promoted',
color_discrete_sequence = ['darkslateblue','tomato'],
orientation='h',
template = 'ggplot2',
labels = {
'department': 'Department',
'employee_id': 'No of Employee',
'is_promoted': 'Is Promoted?',
}
)

This bar plot variation is called stacked bar plot. It help us to see the ranking in general, while also see the share of some more categorical variable inside each levels. For example, even though we still see the original ranking as usual, now we gain the insight that Sales & Marketing department have the highest proportion of promotion relative to the number of employee inside that department. A very crucial finding for this context.
Sometimes, we also need to see the exact difference regarding the breakdown. For this purpose, we could use barmode='group' to show the relative height in the breakdown:
# visualization
px.bar(
data_agg,
x = 'employee_id',
y = 'department',
color = 'is_promoted',
color_discrete_sequence = ['darkslateblue','tomato'],
barmode = 'group',
orientation='h',
template = 'ggplot2',
labels = {
'department': 'Department',
'employee_id': 'No of Employee',
'is_promoted': 'Is Promoted?',
}
)

Data distribution is a, slightly statistical, way to see how our numerical data distributed inside our sample dataset. One thing that should be noted for this visualization: it only works for continuous numerical variable.
In this part, we will discuss how to properly visualize and interpret distribution visualizations using promotion dataset.
Simple Numerical Distribution
The most straightforward way to visualize a data distribution is using histogram plot. An histogram plot could be made by binning our numerical variables into some bins, which each has their unique threshold.
For example, let’s see how length_of_service is distributed between the employees if we use 30 bins:
# visualize
px.histogram(
promotion,
x = 'length_of_service',
nbins = 30
)

The visualization above show us that the most frequent range of age is around 2 to 3 years.
Breaking Down Numerical Distribution
Breaking down the numerical distribution could also giving us more insight regarding our data; it is very useful to compare how data distribution differ between a categorical levels.
There are several way to breakdown a data distribution, and its related on how many levels in our categorical variables.
If we only have two levels inside the categorical variable, it is very straightforward to just differ the histogram color’s parameter:
# Visualize
px.histogram(
promotion,
x = 'length_of_service',
nbins = 30,
color = 'is_promoted',
color_discrete_sequence = ['tomato','darkslateblue'],
title = 'Length of Service Distribution',
template = 'ggplot2',
labels={
'length_of_service': 'Length of Service (years)',
'is_promoted': 'Is Promoted?',
}
)

After creating the two plots above, what is the difference between a bar plot and a histogram?
As we can see from our visualization, a correct breakdown could explain more on our numerical data distribution.
But if we have more than two levels, it would be very messy if we still using histogram plot. So instead of using histogram plot, we could use boxplot to show the data distribution and its breakdown.
Let’s try an example by splitting the age distribution by department and promotion status:
# visualize
px.box(
promotion,
x = 'department',
y = 'length_of_service',
color = 'is_promoted',
color_discrete_sequence = ['tomato','darkslateblue'],
title = 'Length of Service Distribution',
template = 'ggplot2',
labels={
'length_of_service': 'Length of Service (years)',
'is_promoted': 'Is Promoted?',
'department': 'Department'
},
)

As we can see from our plot, breaking down into more categorical variables is very helpful to explore more insight. For example, we could see a strong difference in length of service distribution for Finance, and R&D department; but it is not too strong for another department.
Correlation is also one of the popular context that we could explore. It could help us exploring any relation between the variation of two values.
In this part, we will discuss how to show a proper correlation visualization and its interpretation using promotion dataset.
Between Continous Variables
The most common form of correlation is between continuous numerical variables. It could show us if the two variables are sharing a variation patterns, which oftenly, very insightful to explaining our dataset.
For example, let’s try to visualize how length of service relate to the age for Technology samples.
# subset the data only technology
tech_dept = promotion[promotion['department']=='Technology']
# visualize
px.scatter(
tech_dept,
x = 'avg_training_score',
y = 'length_of_service',
title = 'The Relation of Length of Service at the company and Average Training Score',
labels = {
'avg_training_score': 'Average training score',
'length_of_service': 'Length of Service'
}
)

As we can see from plot above, there is no relation between average training score with the length of sevice; the longer an employee stay at the company, his/her average score isn't depend to the length of service.
It is also common to breakdown the information in our scatter plot. But this is relatively difficult to be achieved, since the scatter plot is very easy to get messy. So the most straightforward way is to make a plot between category; in plotly.express, we could achieve this using facet parameter.
Let’s try to breakdown our scatter plot by promotion status.
px.scatter(
tech_dept,
x = 'length_of_service',
y = 'avg_training_score',
title = 'The Relation of Length of Service at the company and Average Training Score',
facet_col = 'is_promoted',
labels = {
'length_of_service' : 'Leng of Service',
'avg_training_score' : 'Score',
'no_of_trainings': 'No. of Trainings',
'is_promoted': 'Is Promoted?'
}
)

Let’s try to answer that using KPI mets status.
px.scatter(
tech_dept,
x = 'length_of_service',
y = 'avg_training_score',
title = 'The Relation of Length of Service at the company and Average Training Score',
color = 'KPIs_met >80%',
facet_col = 'is_promoted',
size = 'no_of_trainings',
color_discrete_sequence=['tomato','darkslateblue'],
labels = {
'length_of_service' : 'Leng of Service',
'avg_training_score' : 'Score',
'KPIs_met >80%': 'KPIs met >80%?',
'no_of_trainings': 'No. of Trainings',
'is_promoted': 'Promoted Status'
}
)

As we could see, giving more feature to our scatter plot could give more explanation.
Time-based evolution, or simply time series dataset, is highly used analysis in every aspect of business or other technical domains. It could help to see clearer dynamics of a numerical value in terms of time dimension.
# Data wrangling
data_2020 = promotion[promotion['join_date'] >= '2020-01-01']
data_agg = data_2020.groupby(['join_date']).count()['employee_id'].reset_index().head(10)
# Visualize
px.line(
data_agg,
x='join_date',
y='employee_id',
markers=True,
labels={
'join_date':'Join date',
'employee_id':'Number of employee'
}
)

As we could see from the plot, from the time series plot we could gain clearer insight regarding the frequency of employee join date. For interpretation, we should focus on some time series component: trend, seasonality, and shocks. First of all, we could conclude that our time series doesn’t have a significant seasonality pattern, so we could ignore that.
promotion.head(2)
| employee_id | department | region | education | gender | recruitment_channel | no_of_trainings | date_of_birth | age | previous_year_rating | join_date | length_of_service | KPIs_met >80% | awards_won? | avg_training_score | is_promoted | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 65438 | Sales & Marketing | region_7 | Master's & above | Female | sourcing | 1 | 1986-10-21 | 35 | 5.0 | 2013-10-11 | 8 | Yes | No | 49 | No |
| 1 | 65141 | Operations | region_22 | Bachelor's | Male | other | 1 | 1991-11-21 | 30 | 5.0 | 2017-09-19 | 4 | No | No | 60 | No |
data_agg = promotion[promotion['is_promoted']=='Yes']
data_agg = data_agg.groupby(['education']).count()[['employee_id']]
data_agg = data_agg.sort_values(by = 'employee_id')
data_agg
| employee_id | |
|---|---|
| education | |
| Below Secondary | 67 |
| Master's & above | 1507 |
| Bachelor's | 3094 |
px.line()px.scatter()px.bar()px.box()px.histogram()The coursebook covers many aspects of plotting, including using visualization libraries such as plotly.express and other supporting libraries. I hope you've managed to get a good grasp of the plotting philosophy behind plotly.express, and have built a few visualizations with it yourself!
Happy coding!